Skip to content

perf(iterator): -25% on dgraph-shaped scans via lockless isBanned, hoisted fill, internal-key skip, KeyOnly opt-in, user-key lastKey#2285

Open
shaunpatterson wants to merge 8 commits into
dgraph-io:mainfrom
shaunpatterson:perf/dgraph-iter-bench
Open

perf(iterator): -25% on dgraph-shaped scans via lockless isBanned, hoisted fill, internal-key skip, KeyOnly opt-in, user-key lastKey#2285
shaunpatterson wants to merge 8 commits into
dgraph-io:mainfrom
shaunpatterson:perf/dgraph-iter-bench

Conversation

@shaunpatterson
Copy link
Copy Markdown

Summary

Five targeted optimizations to the iterator hot path, measured against a new dgraph-shaped benchmark suite. Composite improvement: −25.4% vs upstream/main on the 5-bench composite (sum of ns/op medians, Apple M4 Max, -benchtime=5s, median-of-3).

Benchmark Baseline Final Δ
PrefixScanKeyOnly 11.80M ns 8.15M ns −31.0%
PrefixScanKeyOnlyOpt (new, opt-in) 7.58M ns −35.8% vs baseline-equivalent
PrefixScanAllVersions 11.98M ns 10.55M ns −11.9%
KeyIteratorAllVersions 1,514 ns 1,376 ns −9.1%
NamespaceOffset/off 5.08M ns 3.78M ns −25.7%
NamespaceOffset/dgraph 5.51M ns 3.80M ns −31.0%
Composite 34.37M 25.71M −25.2%

What changed

Six commits behind the perf series + 2 doc/test commits:

  1. bench — Adds iterator_dgraph_bench_test.go with 5 benches that model dgraph's iteration patterns (NamespaceOffset=1, NumVersionsToKeep=MaxInt32, DetectConflicts=false). These are the metric for every iteration below.

  2. perf: lockless fast path for DB.isBanned when no namespaces banned — Adds monotonic atomic.Bool hasAny on lockedKeys. isBanned returns on a single atomic load when no namespaces are banned (the production steady state for tenants who enable NamespaceOffset for isolation but rarely ban). Eliminates RWMutex.RLock per iterator step. Sets the flag inside the same lock that guards the map, so any reader observing hasAny=true is guaranteed to see the populated map via release-acquire on the lock.

  3. perf: hoist mi.Value() and key out of fill()fill(item)fill(item, key, *y.ValueStruct). The KeyOnly path's parseItem previously made two mi.Value()+Decode calls per kept item. vs is passed by pointer to avoid copying the ~40-byte ValueStruct on every kept item.

  4. perf: skip per-step internal-key probe when prefix excludes badgerPrefix — Precompute canSeeInternalKeys once at iterator construction. When the user's opt.Prefix exists and its first byte differs from badgerPrefix[0], no key the iterator lands on can be a badger-internal key, so the per-step bytes.HasPrefix(key, badgerPrefix) probe is dead code.

  5. perf: IteratorOptions.KeyOnly skip per-item value SafeCopy — New opt-in IteratorOptions.KeyOnly. When set, fill skips the per-item SafeCopy(item.vptr, vs.Value). Item.Value / Item.ValueCopy return a new ErrKeyOnlyMode (strict-error, not silent-empty). PrefetchValues is forced off when KeyOnly=true. Captures an additional ~2–4% on top of the implicit gains for callers (e.g. dgraph's has() evaluator / index scans) that already discard item.vptr.

  6. perf: store user-key only in lastKey, inline same-key compareIterator.lastKey previously held the full key (user-key‖ts). Stores user-key only now. Replaces y.SameKey(lastKey, key) with an inlined len(key)-8 == len(lastKey) && bytes.Equal(key[:ukLen], lastKey). Avoids one ParseKey per SameKey and an 8-byte memcpy on each lastKey update. Same-key dedup invariant preserved exactly.

  7. docs(iterator): clarify FILL key-freshness and ukLen invariants — Strengthened comments on two hot-path invariants surfaced during code review.

  8. test: cover KeyOnly iterator paths and preserved-behavior regressions — New iterator_keyonly_test.go. Together with the benchmarks, every touched line in db.go and iterator.go is now at 100% coverage (verified via go tool cover).

Correctness

Every commit preserves existing semantics:

  • iter1: add() stores the flag inside the same lock that guards the map → release-acquire visibility holds.
  • iter2: key at every fill() call site is the current mi.Key(); the only goto FILL (reverse path) refreshes it after mi.Next().
  • iter3: When canSeeInternalKeys=false, isInternalKey stays false, which is the correct value for prefix shapes that cannot overlap badgerPrefix.
  • iter4: Misuse is loud — Item.Value returns ErrKeyOnlyMode rather than a silent empty slice. PrefetchValues is forced off so the prefetch goroutine never races fill.
  • iter5: lastKey user-key-only representation is preserved across the existing "b 7 (del), b 5" hazard because lastKey is still updated before the deleted-or-expired check.

Tests

  • All preserved-behavior tests (TestPreserved* / TestRegression*) pass on upstream/main AND on this branch — they were validated on both via a worktree.
  • New-API tests (TestKeyOnlyIterator_*, TestCanSeeInternalKeys) pass on this branch.
  • Full test suite (go test -run='^Test' ./...): PASS in 234.7s.
  • 100% line coverage on every line modified across db.go (isBanned + lockedKeys.add) and iterator.go (Item.Value/ValueCopy/EstimatedSize/ValueSize, NewIterator, canSeeInternalKeys, fill, parseItem).

Test plan

  • go test ./... — full suite passes
  • go test -bench=BenchmarkDgraph -benchtime=5s -count=3 . — composite −25.2% vs main
  • go tool cover -func=... — 100% on touched lines
  • Worktree validation: preserved-behavior tests also pass on upstream/main
  • CI runs (this PR will trigger them)

Notes for reviewers

  • IteratorOptions.KeyOnly is opt-in; existing callers see no behavior change. Dgraph would need a one-line change at its has() / index-scan iterator construction sites to capture the additional 2–4% on those paths.
  • The composite metric is sum-of-medians for easy single-number comparison; the largest absolute benchmark (PrefixScanAllVersions) dominates and hides smaller relative wins on KeyIteratorAllVersions. Per-benchmark breakdown above tells the full story.
  • No public API removals or breaking changes. The only public additions are IteratorOptions.KeyOnly (bool, default false) and ErrKeyOnlyMode (sentinel error).

🤖 Generated with Claude Code

shaunpatterson and others added 8 commits May 23, 2026 14:30
Add four benchmarks modeling dgraph's posting-layer iterator hot
paths. dgraph drives badger in a specific combination of options
that none of the existing root-package benchmarks exercise:

  - NamespaceOffset = 1 (DB.isBanned runs on every key)
  - NumVersionsToKeep = math.MaxInt32 (all versions retained)
  - DetectConflicts = false (dgraph owns OCC)
  - PrefetchValues = false (key-only scans in every hot path)
  - AllVersions = true (MVCC-aware reads)
  - Prefix-bounded forward iteration over mid-sized structured keys

Keys mirror dgraph's data-key layout (1B type + 8B namespace + 2B
attr-len + ~20B attr + 1B subtype + 8B UID = 40 bytes).

Benchmarks:

  - BenchmarkDgraphPrefixScanKeyOnly: models has() / index-scan
    paths (posting.MemoryLayer.IterateDisk, worker.sort.go)
  - BenchmarkDgraphPrefixScanAllVersions: models the rollup
    iterator with multiple MVCC versions per key
  - BenchmarkDgraphKeyIteratorAllVersions: models readFromDisk
    via NewKeyIterator over a single posting key (every cache miss)
  - BenchmarkDgraphPrefixScanNamespaceOffset: isolates the cost of
    NamespaceOffset=1 vs -1 (i.e. the per-key DB.isBanned tax)

These benchmarks are intended as the measurement baseline for
upcoming iterator hot-path optimizations.
DB.isBanned runs on every iterator step and every key access when
NamespaceOffset >= 0. The hot path acquires lockedKeys.RLock and probes
the keys map even when the ban set is empty (the steady-state case in
production deployments like dgraph that enable NamespaceOffset for
multi-tenant key isolation but rarely ban anything).

Add a monotonic atomic.Bool flag (lockedKeys.hasAny) that flips from
false to true on the first add() and never flips back. isBanned can
check this flag with a single atomic load and skip the lock + map
lookup when no ban has ever been recorded.

Correctness: add() stores the flag inside the same lock that guards
the keys map, so any reader observing hasAny=true is guaranteed to
see the populated map via release-acquire ordering on the lock.
Readers observing hasAny=false correctly return early because the
flag is monotonic and is set before any key is observable in the map.
DB-reopen replay path also calls add() so reopened DBs retain the
flag for previously-recorded bans.

Measured on dgraph-shaped iterator benchmarks (Apple M4 Max,
median-of-3, -benchtime=5s):

  PrefixScanKeyOnly        11.80M -> 10.56M ns/op  (-10.5%)
  PrefixScanAllVersions    11.98M -> 10.60M ns/op  (-11.4%)
  KeyIteratorAllVersions    1,514 ->  1,371 ns/op   (-9.4%)
  NamespaceOffset/dgraph    5.51M ->  4.96M ns/op  (-10.0%)
  NamespaceOffset/off       5.08M ->  5.01M ns/op   (-1.4%, noise)

The NamespaceOffset/dgraph case now matches NamespaceOffset/off
(both ~4.96-5.01M ns/op) -- the entire ~8.5% NamespaceOffset tax
from the per-key RLock is eliminated.

Tests: full suite passes (TestBannedPrefixes, TestIterateWithBanned
verified) except TestProtosRegenerate which fails identically on
baseline due to missing protoc binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…decode

Iterator.parseItem previously called mi.Value() twice per kept item on
the non-AllVersions hot path:
  1. To pre-check Meta/ExpiresAt for the deleted/expired filter.
  2. Inside fill(), which re-fetched the same ValueStruct (cost includes
     an Arena getVal + ValueStruct.Decode that the CPU profile showed
     consuming ~40% of cumulative samples on dgraph-shaped key-only
     prefix scans).

Refactor fill to take the already-fetched key []byte and a pointer to
the already-decoded *y.ValueStruct, and hoist mi.Value() into both the
AllVersions and non-AllVersions branches of parseItem. fill no longer
touches it.iitr.

Pass vs by pointer (not value): y.ValueStruct is ~40 bytes; passing by
value added a per-item struct copy that wiped out the AllVersions win
(branch only had one Value() call, no double-decode to eliminate).
Pointer collapses to a single 8-byte arg; fill does not store the
pointer so escape analysis keeps vs on the stack.

Correctness:
- vs and key are fetched while the iterator is positioned on the
  current entry; no mi.Next() runs between fetch and fill. Arena/block
  memory backing vs.Value and key is stable for that window.
- In the reverse-iteration goto FILL path, mi has advanced once, so
  the refactor refreshes key = mi.Key() before re-entering FILL, and
  FILL re-fetches vs := mi.Value(). Semantically equivalent to the
  pre-refactor code.

Measured on dgraph-shaped iterator benchmarks (Apple M4 Max,
median-of-3, -benchtime=5s) -- composite (sum of 5 medians) drops
31.13M -> 26.37M ns (-15.3% vs iter1, -23.3% vs upstream baseline):

  PrefixScanKeyOnly        10.56M -> 8.18M ns/op   (-22.5%)
  PrefixScanAllVersions    10.60M -> 10.53M ns/op  (-0.7%, flat)
  KeyIteratorAllVersions    1,371 ->  1,376 ns/op  (+0.4%, noise)
  NamespaceOffset/off       5.01M -> 3.83M ns/op   (-23.6%)
  NamespaceOffset/dgraph    4.96M -> 3.83M ns/op   (-22.7%)

Tests: full suite passes (TestProtosRegenerate continues to fail
identically due to missing protoc binary).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Iterator.parseItem unconditionally ran bytes.HasPrefix(key, badgerPrefix)
on every step to detect badger-internal keys (e.g. !badger!banned). For
prefix-bounded scans whose Prefix's first byte differs from
badgerPrefix[0] ('!'), no key the iterator can land on can possibly be
internal -- the probe is dead. The CPU profile after iter2 attributed
~13% of cumulative samples to runtime.memequal, a portion of which was
this probe.

Precompute canSeeInternalKeys once at iterator construction:
  - empty prefix: true (unbounded scan can hit internal keys)
  - prefix[0] == '!' : true (could be the banned-ns recovery iterator
    or any scan that starts in the internal-key range)
  - otherwise: false (the iterator after Seek(prefix) is positioned
    past the internal-key range and hasPrefix() will reject any
    !-prefixed key anyway, so the per-step probe is provably dead).

When the flag is false, skip both the HasPrefix probe and the
!InternalAccess skip-block. isInternalKey defaults to false, which is
also the correct value for the downstream "!isInternalKey && isBanned"
gate (non-internal keys are subject to ban checks; internal ones are
not, but here we know they can't appear).

Correctness:
- initBannedNamespaces sets Prefix = bannedNsKey (starts with '!') so
  the flag is true and the probe still runs on the recovery path.
- All other badger-internal access uses InternalAccess=true with a
  badgerPrefix-anchored Prefix, so the flag is also true there.
- For dgraph's data scans (Prefix[0] = 0x00) and all other user scans
  the probe is provably dead and can be skipped.

Measured on dgraph-shaped iterator benchmarks (Apple M4 Max,
median-of-3, -benchtime=5s):

  PrefixScanKeyOnly        8.18M -> 7.92M ns/op  (-3.2%)
  PrefixScanAllVersions   10.53M ->10.50M ns/op  (-0.2%, noise)
  KeyIteratorAllVersions  1,376  -> 1,350 ns/op  (-1.9%)
  NamespaceOffset/off      3.83M -> 3.72M ns/op  (-2.6%)
  NamespaceOffset/dgraph   3.83M -> 3.74M ns/op  (-2.4%)

Cumulative since upstream baseline: composite 34.37M -> 25.89M ns
(-24.7%); PrefixScanKeyOnly 11.80M -> 7.92M (-32.9%);
NamespaceOffset/dgraph 5.51M -> 3.74M (-32.1%).

Tests: TestBannedPrefixes, TestIterateWithBanned, TestIterator*, and
TestKeyIterator* all pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add opt-in KeyOnly flag to IteratorOptions. When set, fill() skips the
per-item SafeCopy of vs.Value into item.vptr, which is the largest
remaining memmove on the key-only forward-scan hot path
(dgraph has()/index scans).

Trade-offs (Option A, strict):
- Item.Value / Item.ValueCopy return new ErrKeyOnlyMode.
- Item.ValueSize / Item.EstimatedSize report 0 / key-size only.
- Item.Key, Version, UserMeta, ExpiresAt, IsDeletedOrExpired unaffected.
- NewIterator forces PrefetchValues=false when KeyOnly=true so the
  prefetch goroutine never starts.

Existing callers see no behavior change. The new benchmark
BenchmarkDgraphPrefixScanKeyOnlyOpt exercises the opt-in path
alongside the original PrefixScanKeyOnly to preserve comparison.

Results (5-bench composite, median-of-3, Apple M4 Max):
- PrefixScanKeyOnlyOpt:  7.76M ns/op (−2.0% vs iter3 KeyOnly, −34.3% vs baseline)
- Composite with opt-in: 25.63M ns (−25.4% vs upstream baseline)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Iterator.lastKey previously stored the full key (user-key || 8B ts) but
SameKey only compared the user-key portion. Two consequences:

1. SafeCopy(it.lastKey, key) copied 8 bytes of dead ts every kept item.
2. y.SameKey called ParseKey on both src and dst (two slice ops + a
   function call) on the hot dedup compare.

Change: store the user-key only in lastKey; inline the compare.

  ukLen := len(key) - 8
  if ukLen == len(it.lastKey) && bytes.Equal(key[:ukLen], it.lastKey) {
      mi.Next()
      return false
  }
  it.lastKey = y.SafeCopy(it.lastKey, key[:ukLen])

The same-key dedup invariant ("skip later versions of a user-key we've
already considered, including tombstoned ones") is preserved exactly:
lastKey is still updated before the FILL/deleted-or-expired check.

Result: BenchmarkDgraphPrefixScanKeyOnlyOpt: 7.76M -> 7.58M ns (-2.3%).
Other benchmarks within run-to-run noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Strengthen comments on two hot-path invariants surfaced during code
review of the perf series:

1. FILL label entry contract: `key` is always mi.Key() at the current
   iitr position. The only goto FILL (reverse path) refreshes key after
   mi.Next(); the fall-through entry above never advances the iterator
   between `key := mi.Key()` and reaching FILL. This is what lets
   fill() safely reuse the caller-supplied key without re-calling
   mi.Key() per item.

2. lastKey ukLen safety: `ukLen := len(key) - 8` is safe because
   len(key) >= 8 is a badger-wide invariant (every LSM key is stored
   with an 8-byte timestamp suffix via y.KeyWithTs). y.ParseTs(key)
   above already relies on this — no defensive check needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a comprehensive test file that hits every line of the perf series
touched in db.go and iterator.go. Two test groups:

* TestKeyOnlyIterator_* — exercise the new KeyOnly opt-in path:
  Value/ValueCopy return ErrKeyOnlyMode; EstimatedSize returns key
  length; ValueSize returns 0; NewIterator forces PrefetchValues=false;
  Key/UserMeta/Version/ExpiresAt continue to work under KeyOnly.

* TestRegression* — lock in pre-existing behaviors that the perf series
  rewrote in place: same-key dedup across multiple MVCC versions
  (iter5 lastKey rewrite), internal-key hiding (iter3 canSeeInternalKeys
  gate), isBanned correctness across the new atomic.Bool fast path
  (iter1) including short-key-with-bans branch, and non-KeyOnly fill
  path correctness (iter2 hoist).

* TestCanSeeInternalKeys — table-driven unit test for the helper
  added in iter3.

All preserved-behavior tests also pass on upstream/main, proving the
expectations match the unchanged baseline. Together with the
benchmarks added in 7174cc2, the touched lines in db.go and
iterator.go now have 100% coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaunpatterson shaunpatterson requested a review from a team as a code owner May 23, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant